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ABSTRACT 

A few works address the challenge of automating software 
diversification, and they all share one core idea: using auto¬ 
mated test suites to drive diversification. However, there is 
is lack of solid understanding of how test suites, programs 
and transformations interact one with another in this pro¬ 
cess. We explore this intricate interplay in the context of a 
specific diversification technique called “sosiefication”. 

Sosiefication generates sosie programs, i.e., variants of a 
program in which some statements are deleted, added or 
replaced but still pass the test suite of the original pro¬ 
gram. Our investigation of the influence of test suites on 
sosiefication exploits the following observation: test suites 
cover the different regions of programs in very unequal ways. 
Hence, we hypothesize that sosie synthesis has different per¬ 
formances on a statement that is covered by one hundred test 
case and on a statement that is covered by a single test case. 
We synthesize 24 583 sosies on 6 popular open-source Java 
programs. Our results show that there are two dimensions 
for diversification. The first one lies in the specification: the 
more test cases cover a statement, the more difficult it is to 
synthesize sosies. Yet, to our surprise, we are also able to 
synthesize sosies on highly tested statements (up to 600 test 
cases), which indicates an intrinsic property of the programs 
we study. The second dimension is in the code: we manu¬ 
ally explore dozens of sosies and characterize new types of 
forgiving code regions that are prone to diversification. 

1. INTRODUCTION 

Software diversity, i.e., the availability of multiple vari¬ 
ants of a program that provide the same functionality with 
different implementations, is of great interest for software 
engineering. The early exploitation of such diversity was for 
fault-tolerance in critical software systems . More re¬ 

cently, the existence of multiple, diverse versions of the same 
function has been exploited for survivable architectures , 
cross-checking oracle [^, self-adaptation [12] , intrusion de¬ 
tection and multi-level diversification 


Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that copies 
bear this notice and the full citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. 

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. 


As opposed to the exploitation of manually created soft¬ 
ware diversity as in N-version programming [^, there is a 
research area on automatic software diversity, ignited by the 
seminal works of Cohen and Forrest [^. Automatic di¬ 
versification has been widely explored at machine-code level 
for security purposes [^, but only a few works tackle this 
challenge in application-level source code 20 lIBIil- 
They all share the same core idea: using automated test 
suites to drive diversification. In short, the process consists 
of transforming the original program to get a variant and 
of running the test suite to assess the validity of the vari¬ 
ant. However, there is is lack of solid understanding of how 
test suites, programs and transformations interact one with 
another in this process. There lies our contribution. 

In this work, we consider a specific diversification tech¬ 
nique called “sosiefication” [^. Sosiefication creates sosie 
programs that are variants of a program in which some state¬ 
ments are deleted, added or replaced but still pass the test 
suite of the original program. Our intuition is that the test 
suite of a program, the basis for all recent works on au¬ 
tomatic diversification, covers the different regions of the 
program in very unequal ways, and that it has an impact 
on sosiefication. We hypothesize that synthesizing a sosie 
on a statement that is covered by one hundred test case is 
different from synthesizing a sosie on a statement that is 
covered by a single test case. The difference lies in the ease 
of synthesis and in the quality of the resulting sosie. This is 
what we explore in this paper. 

Technically, we synthesize 24 583 sosies on 6 popular open- 
source Java programs that are available with very solid JU- 
nit test suites. For each of them, we compute all “execution 
signature” per statement, a short expression that refers to 
the number of test cases that execute a given code region. 
We consider this metric as a proxy to the “amount of spec¬ 
ification” - so to speak - of this region We show that this 
metric greatly varies for the statements inside a program. 
We use this metric as guiding light for our investigation of 
the mechanisms that underlie sosiefication. 

We show that there is a relation between execution sig¬ 
natures and the efficiency of the sosiefication process: the 
more a statement is tested, the more difficult it is synthe¬ 
size sosies. However, to our surprise, we are still able to 
synthesize sosies on highly tested statements (up to 600 test 
cases). To us, this indicates an intrinsic property of the 
software subjects under study. 

In addition to a quantitative analysis on the sosiefication 
process, we perform a qualitative investigation of sosiefica¬ 
tion via manual assessment. We propose a first categoriza- 





tion of sosies, where each category relates to a specific kind 
of code region (e.g. optimization code). This extends the 
body of knowledge about forgiving code regions [^. In 
particular, we find regions characterized by “plastic specifi¬ 
cations”, i.e. regions which are governed by a very open yet 
strong contract. For instance, the only correctness contract 
of a hashing function is to be deterministic. On the one 
hand this is a strong contract. On the other hand, this is 
very open: many variants of an hashing function are valid, 
and consequently, many modifications in the code result in 
valid hashing functions. 

We believe that our findings based on a specific diversifi¬ 
cation technique - sosiefication - can be exploited for other 
diversification approaches. We provide novel insights about 
two dimensions of diversification. First, we shine a spot¬ 
light on the existence of plastic parts in program specifica¬ 
tions. The literature has already identified some, e.g., video 
compression and in this paper we reveal a new one based 
on hashing function. But we are convinced that there are 
many other such plastic specifications. Future research has 
to build a comprehensive catalog of plastic behavior. 

The second dimension is in the code. The forgiving re¬ 
gions parts of the code are those that can be easily modified 
while maintaining acceptable behavior. Often, the imple¬ 
mentation of plastic specifications are forgiving (such as the 
implementation of a video codec). However, this is not a 
bijection. In our manual analysis, we have encountered for¬ 
giving statements in zone that are every conventionally bi¬ 
nary in their specification. There is a need for research on 
the intersection of plastic behavior and forgiving regions. 

To sum up, the contributions of the paper are: ^ 

• an empirical analysis of the interplay between programs s 
and their test suites that demonstrates the wide variety 

of execution signatures 

• quantitative evidence of the relation between the uneven 9 
coverage of statements and the opportunities for auto¬ 
matic program transformations 

• a deeper understanding of forgiving code regions that can 13 
be exploited for sosiefication as well as for other forms of 
automatic diversification (as targets for automatic trans- 
formation). 

The paper is organized as follows. Section presents a 
preliminary analysis that demonstrates uneven coverage of 
different regions of a program by its test suite. Section 
recalls the essentials about sosie synthesis, as well as our 
experimental protocol. Section [^presents and discusses our 
main findings about the interplay between a test suite, a 
program and the opportunities for sosie synthesis. Section 
[^outlines the related work and section [^concludes. 

2. A PRELIMINARY STUDY ABOUT STATE¬ 
MENT EXECUTION SIGNATURES 

In this paper, we are interested in how test suites, pro¬ 
grams and transformations interact. In this section we ex¬ 
plore the relation between the first two: test suite and pro¬ 
grams. We perform a preliminary experiment about the 
interplay between the test suite of a program and its state¬ 
ments. We consider projects written in Java and coming 
with a JUnit test suite. In this, test code is clearly sepa¬ 
rated from application code and each test case includes one 
or more method calls, and one or more assertions that ex¬ 
press the expected properties about the program’s behavior. 


2.1 Collecting Statement Execution Signatures 

We have developed a tool, called SESig, which collects 
fine-grained metrics about how the statements in a Java 
program are covered by a test suite. It collects the following 
metrics about each statement: 1. the number of test cases in 
the test suite that cover the statement s; 2. The execution 
depth of s. We associate a vector Depths to each statement, 
such that, given the set {fi, ...tn} of test cases that cover s 
Depths = \depth{s,ti)\i^yo..n]- depth{s,ti) is the depth of s 
in the call stack when running 

For example, let us consider the method append from com¬ 
mons.lang 3.3.2 (Listing [^. SESig collects the following in¬ 
formation. The method is executed by 28 different test cases 
and all statements of the method but one (line are cov¬ 
ered. Most statements are executed by one test case only, 
except the two statements in lines and that are exe¬ 
cuted by 24 and 25 different test cases respectively. We also 
observe that most statements are executed at depth 1 ex¬ 
cept ones in lines and that are executed only at depth 
6 . Listing shows the stack trace when stopping on this 
statement: we clearly see that they are not directly exer¬ 
cised by a test case. Statements in lines |13|and |15| appear at 
different depths, indicating that the different test cases that 
cover them trigger these behaviors in different contexts. 

Listing 1: The append method from FieldUtils in 
commons, lang 

public EqualsBuilder append( final boolean [] Ihs, 
final boolean [] rhs) { 
if (isEquals == false) { 

return this;} //(0,[]) 

if (Ihs == rhs) { 

return this;} //(1,[1]) 

if (Ihs == null I I rhs == null) { 
this .setEquals( false ); //(1,[1]) 

return this ;} // (1 , [1]) 

if (Ihs.length != rhs.length) { 

this .setEqual s(false); //(I, [6]) 

return this ;} // (1 , [6]) 

for (int i = 0; i < Ihs.length && isEquals; ++i) { 
append(lhs[i], rhs[i]); /(24,[1,2,5,6]) 

} 

return this;} //(25 , [1,2,3,5,6]) 


Listing 2: Stack trace when stopping at line 

EqualsBuilder.append :899 
EqualsBuilder.append :487 
EqualsBuilder.reflectionAppend :411 
EqualsBuilder.reflectionEquals :360 
EqualsBuilder.reflectionEquals :295) 

DiffBuilder.<init> :111 

DiffBuilderTest.testBooleanArray : 110 

SESig adds probes in the test suite and the program, at 
the following locations: entrance and exit of a test case, 
entrance and exit and methods in the program, bifurcation 
of branches inside a method, each statement in the program 
(this latter probe collects the depth of the statement in the 
call stack and id of the test case that is currently running). 
The tool is publicly available as open source 

2.2 Empirical Observations 

We now explore the test suite execution at the level of 
an entire project. Figure displays the signatures of all 

^ When counting the depth in the call stack, we ignore calls to ex¬ 
ternal libraries. 
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Figure 1: Interplay between program statements and test suites in Apache Commons Lang. A point is a statement covered 
by n test cases, where n is the X-axis. The Y-axis is the mean depth of the statement when running the whole test suite. 


statements in Apache Commons Lang that are covered by 
one test case at least. Each point is a statement and its 
position indicates the number of test cases that cover it (X- 
axis) and its median depth in the call stack when running 
the test cases (Y-axis). 

The x-axis captures the disparity in terms of coverage, 
summarized in a boxplot at the top of the figure: some 
statements are covered by no more than one test case (4243 
statements), while some others are covered by hundreds of 
test cases (77 statements are covered by more than 100 test 
cases). Yet, test coverage is very skewed towards low values: 
25% of the statements are covered by a single test case and 
50% are covered by one or two test cases. 

The y-axis captures the disparity in the relative position of 
a statement in the execution flow of a test suite: a majority 
of statements are executed close to the test case (at a depth 
lower than 5), while some others appear much deeper and 
are most probably tested only as a side-effect of testing other 
methods. For example, statement at line of Listing 
appears at a depth of 6 calls in the stack and is not the main 
testing target of the single test case that covers it. What 
clearly appears here is that a vast majority of statements 
appear quite close to the test cases that cover them (75% of 
statements have a median depth below 2.5). 

We manually looked at the extreme cases. The statements 
that appear very deep in the stack (more than 13, on the top 
part of figure are statements in recursive methods. These 
have a high median depth value and also very large variance 
in their depth value: all of them happen to be actually tested 
at depth 1 as well as at depth above 30. Looking at the 
statements that are covered by many test cases (on the right 
of the plot), we remark that they are also always at a median 
depth greater than 1. These statements are mostly in utility 
methods that are used by many other methods, hence all of 
them are both directly tested and indirectly tested through 
the test case of client code (e.g., the right-most statements 
are all in the ToStringStyle class). 

We performed the analysis for other programs that will 
be used later in this paper and presented in Table [^. All 
plots are available onlin^ The maximum values for the 
number of covering test cases and median depth vary from 
one project to the other: the most covered statement of com¬ 
mons.codec is covered by 105 test cases, while the maximum 
of commons.collection is 1780 test cases that cover a state¬ 
ment; the median depth varies from 1 to 8 in commons.io 
and from 1 to 1863 in GSon. Yet, some major trends are 
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observed in all projects: (i) statements are always very un¬ 
equally covered by the test suite; (ii) 50% of the statements 
are covered by a small number of test cases: this number 
varies between 2 (as in the case of lang) and 11 (in GSon); 
(iii) the statements that appear very deep in the execution 
stack are always in recursive methods (the most extreme 
cases were observed in GSon, where some statements ap¬ 
peared as deep as 3692); and (iv) the statements that are 
covered by a large number of test cases occur at a mean 
depth greater than 2 because they are in utility methods or 
in private methods, hence mostly executed by methods in 
the program rather than directly by the test cases. 


To sum up, this preliminary study suggests that 
the statements have very different execution signatures. 
Our intuition is that we can leverage these large varia¬ 
tions among signatures to characterize the interplay be¬ 
tween test suites and program statements for software 
diversification. 


3. ANALYSIS OF A DIVERSIFICATION TECH¬ 
NIQUE 

We have observed that the interplay between a test suite 
and the the statements of the program under test produces 
very different statement signatures. Our goal is now to re¬ 
late these statement signatures to a particular diversification 
technique: sosiefication. 

3.1 Sosie synthesis 

Sosiefication is the process of synthesizing sosies. We have 
introduced it in our previous work on software diversity [^. 

The word sosie is a French word that literally means “look 
alike”. 

Definition 1. Sosie (noun). Given a program P, a test 
suite TS for P and a program transformation T, a variant 
P'=T{P) is a sosie of P if the two following conditions hold 
1) the part of P that is modified by T is covered by one test 
case at least; 2) all test cases inTS pass on P'. 

Given an initial program, we synthesize sosies with source 
code transformations that modify the abstract syntax tree 
(AST). We consider three types of transformation that ma¬ 
nipulate statement nodes of the AST: 1) remove a node in 
the AST (Delete); 2) adds a node just after another one 
(Add); 3) replaces a node by another one from the same 









AST (Replace). We call the transplantation point the 
statement on which we perform a transformation. For add 
and replace, we also refer to the transplant statement 
that is copied and inserted. The transplantation and trans¬ 
plant points are in the same AST (we do not synthesize new 
code, nor take code from other programs). 

Sosiefication consists in randomly picking an AST state¬ 
ment node and try to apply the three transformations. Yet, 
for replace and add, we introduce some constraints. First, 
a statement cannot be replaced by itself; AST nodes of 
type case, variable declaration, return and throw are only 
replaced by statements of the same type; the type of the 
value returned by a return statement must be the same for 
the original and new statement. Second, we consider trans¬ 
plant statements that manipulate variables of the same type 
as the transplantation point, and we rename the variables 
of the transplant with names of variables of the correspond¬ 
ing type, which are in the namespace of the transplantation 
point. We call this Steroid transformations [^. 

Since the sosiefication process consists in applying a trans¬ 
formation on a program and then running the test suite to 
select sosies, it can look similar to mutation testing. Sosies 
might even be thought of as equivalent mutants. Yet, both 
approaches are conceptually different: program transforma¬ 
tions for mutation testing are designed according to fault 
models, while the sosiefication transformations are designed 
to explore the neighbourhood of similar programs; muta¬ 
tion testing aims at assessing the ability of a test suite at 
detecting the injected bugs, while sosiehciation aims at syn¬ 
thesizing variants of a program that exhibit a form of diver¬ 
sity. Also, we have shown that, by opposition to equivalent 
mutants, sosies can behave differently from the original and 
produce different results under certain conditions (and 
we illustrate more examples in section 4.31. 


3.2 Metrics 

We now present a metric that characterize the sosiefication 
process, as well as the features that characterize a transplan¬ 
tation point in which sosiehcation can be applied. 


Definition 2. Sosiefication Rate (SR) is the ratio 
between the number of sosies (variants that pass the test 
suite), and the total number of transformations done, one 
transformation being a trial to produce a program variant: 
if Sosies / #T rials. 


Sosiefication is an expensive process, which uses a lot of 
computation power. From an engineering perspective, it is 
good to generate as many sosies as possible in any given 
amount of time. To this extent, it is better to maximize the 
sosiehcation rate. 

Our goal is to explore the relations between transplanta¬ 
tion points and the sosiehcation rate. For instance, we are 
especially interested in the transplantation point features 
that maximize the sosiehcation rate. We focus on the fol¬ 
lowing features to characterize transplantation points. 


Definition 3. Transplantation point features: Let us 
call T the transplantation point yielding the sosie. We focus 
on the following features: 1) TCt is the number of test cases 
that execute r. 2) TransfoT is a categorical feature that 
characterizes the type of transformation that we performed 
on t: add, delete or replace. This can be further refined by 
considering the type of AST node where the transformation 


The collection of all those features is implemented in a 
tool that is publicaly available]^ 

3.3 Experimental Protocol 

In this paper, we perform the following experiment. For 
a set of programs considered as a dataset (presented in ta.- 
ble[^, we synthesize a set of sosies. For this, we use the 
“Steroid” strategy as described in section [3Tl This process 
is budget based: we try neither to exhaustively visit the 
search space nor to have a hxed-size sample. Since sosieh¬ 
cation is an expensive process, our computation platform is 
GridSOOO, a scientihc platform for parallel, large-scale com¬ 
putation We submit one batch for each program, it is 
run as long as resources (CPU and memory) are available 
on the grid. Then, for each sosie, we extract or compute 
the metrics described in previous section. We also manu¬ 
ally analyze dozens of sosies in order to build a taxonomy of 
sosies. 


Table 1: Descriptive statistics about our subject programs 



#classes;(fistmt 

#TG 

cov. 

commons-lang 3.3.2 

132 

8442 

2352 

94% 

commons-collections 4.0 

286 

6780 

13677 

84% 

commons-codec 1.10 

60 

2695 

662 

96% 

commons-io 2.4 

103 

2573 

962 

87% 

Gson 2.3.2 

66 

2377 

951 

79% 

jgit 3.7.0 

666 

22333 

2758 

70% 


We consider the 6 programs presented in table All pro¬ 
grams are popular Java libraries developed by the Apache 
foundation, Google or Eclipse. The second column gives the 
number of classes, the third column the number of state¬ 
ments. Column 4 provides the number of test cases execu¬ 
tions when running the test suite and column 5 gives the 
statement coverage rate. 

The programs range between 60 and 666 classes. All of 
them are tested with very large test suites that include hun¬ 
dreds of test cases that execute the program in many diher- 
ent situations. One can notice the extremely high number 
of test cases executed on commons-collection. This results 
from an extensive usage of inheritance in the test suite, hence 
many test cases are executed multiple times (e.g., test cases 
that test methods declared in abstract classes). The test 
suites cover most of the program (up to 96% statement cov¬ 
erage for commons-codec). Jgit is the exception (only 70% 
coverage): it includes many classes meant to connect to dif¬ 
ferent remote git servers, which are not covered by the unit 
test cases (due to the difficulty of stubbing these servers) 
This dataset provides a solid basis to investigate the inter¬ 
play between test suites and sosiehcation. 

3.4 Research Questions 

We contribute to the exploration of two general prob¬ 
lems of software diversihcation: how to effectively synthe¬ 
size diverse software? what property of software should be 
searched and exploited for the sake of diversihcation? The 
following research questions are contributions in this direc¬ 
tion. 
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3.4.1 RQl: Is the sosiefication rate SR higher for 
statements that are less tested (in terms of num¬ 
ber of test cases)? 


One criticism often made about techniques that rely on 
test suites to automatically transform programs [14| |15[ 

~ is that test suites are not strong enough to ensure 
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the validity of variants. The intuition behind this criticism 
is that if a program is badly tested, it is easy to generate 
variants of the program that still pass the test suite. This 
research question investigates to what extent this is true 
for sosie synthesis, by comparing the sosiefication rates on 
poorly test regions with the rates on highly tested regions. 


3.4.2 RQ2: what is the relation between the types 
of transplantations points and transplants and 
the test suite execution ? 

We would like to understand the interplay between the 
transformation operators and the test suite. For instance, 
it may happen that if-conditions are better specified than 
methods calls. This has a direct impact on sosiefication, 
while the sosiefication on if-conditions may yield a higher 
sosiefication rate, they may also be of worse quality. There 
are three dimensions in the qualification of transformations: 
1 ) how they are applied (addition of new code versus dele¬ 
tion of existing code); 2) where they are applied, i.e. the 
transplantation points (e.g. ifs versus method calls); and 3) 
for addition and deletion, the type of the transplant. This 
research question studies those three dimensions. 


3.4.3 RQ3: What are the different kinds of good sosies 
that we can generate ? 

In our experience, certain sosies are really interesting, and 
others are “bad”. The bad ones are those that are obviously 
incorrect. These sosies pass the test suite, by construction, 
but they happen in parts that are loosely specified. 

Meanwhile, our experience also showed that there exists 
different kinds of good sosies, e.g., sosies that introduce true 
diversity in the computation and not merely bugs. This 
research question relies on the manual analysis of dozens of 
sosies from all programs of our dataset, to build a taxonomy 
of good program sosies. 


4. EMPIRICAL RESULTS 

We apply our experimental protocol on 6 Java programs. 
Table 1^ gives the key data about the sosies computed with 
the budget based approach described in |3.3| The second 
column indicates the number of sosies we generated for each 
program, the third column indicates the global sosiefication 
rate (SR), i.e., among all variants that we generated how 
much were actual sosies (the other variants either don’t com¬ 
pile or fail for one test case at least), the next columns indi¬ 
cate the number of sosies synthesized by adding, deleting or 
replacing statements, the last column indicates the rate of 
statements for which we generated variants, i.e., the number 
of statements that served as transplantation points over all 
statements. This last metric provides an indication of how 
much we tried to sosiefy in all regions and thus to what ex¬ 
tent we can exploit the findings of section to investigate 
the sosies. The low rate for jgit is related to large size of our 
project: since sosiefication has a bounded a resource budget, 
we cannot cover a large program as much as a small one. 


Table 2: The Sosie Programs Considered on our Empirical 
Investigations 



^sosies sosief. 
rate 
(SR) 

add 

del 

rep 

expl. 

rate 

lang 

1146 

9.6% 

419 

190 

537 

78% 

collections 

8626 

10 .8% 

3912 

754 

3960 

83.3% 

codec 

701 

10.4% 

289 

146 

266 

91.9% 

io 

3545 

13.9% 

1754 

319 

1472 

92% 

Gson 

4311 

14% 

2199 

215 

1897 

80.3% 

jgit 

6262 

16% 

1924 

1375 

2963 

57% 


4.1 RQl: Relation between Statement Execu¬ 
tion Signature and Sosiefication 

We try to apply one or more transformation at each trans¬ 
plantation point, in order to create sosie programs. Each 
trial produces a program variant, which either fails at com¬ 
piling or fails at passing the suite or be a sosie, and we 
then compute the sosiefication rate (cf. definition]^ at each 
transplantation point. Since a transplantation point is a 
statement, we use SESig to retrieve the number of test cases 
that cover it. 

We analyze the cumulative sosiefication rate at transplan- 
tion points covered by a given number of test cases. Figure 
[^provides this data as scatter plots. We have removed the 
outliers (sosiefication rate that are to high due to degerated 
cases discussed below). It contains 6 subfigures, one per 
project of our dataset. For instance, the first figure is for 
Apache Commons io. This program includes 845 statements 
covered by a single test case, 756 of them are optential trans¬ 
plantation points for trying sosie synthesis. The cumulative 
sosiefication rate for these points is 15%: we performed a 
total of 10959 trials on the 756 points and 1644 were actual 
sosies. 

In the top right corner of each plot, we also include a zoom 
on the left hand side of the distribution (e.g. from 1 to 20 
test cases for Commons IO). The rational for this zoom is 
that a vast majority of the statements - hence transplan¬ 
tation points - are on the left (as shown in section the 
distribution of statement coverage is highly skewed towards 
low values), and this is also where we performed the highest 
numbers of trials. 

This data can be interpreted as follows. First, for all 
projects, the sosiefication rate tends to decrease with the 
number of test cases. The slope of the decrease varies be¬ 
tween 4 X 10“® and 7 x 10“® for global plots. This is a 
variation of three orders of magnitude. The slope itself is 
low because the X-axis is an absolute number of test cases 
going up to 10® while the Y-axis is by construction between 
0 and 1. The general tendency to decrease can be explained 
by the fact that more test cases means more testing sce¬ 
narios and more assertions, which means that this lets less 
space for unspecified behavior. Since the sosiefication pro¬ 
cess heavily explores this space by construction, more test 
cases directly results in a lower sosiefication rate. In other 
words, the increase in specification quality yield fewer sosies 
(the buggy program variants being killed). Interestingly, 
the decrease in the zooms, i.e. for the poorly tested sosies, 
is higher with slopes ranging from 2 x 10~® to 4 x 10~®. This 
can be interpreted by the accentuation of the “specification 
quality improvement” phenomenon on the left part of the 
plot: we believe that, in terms of behavioral specification, 
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Figure 2: Distribution of sosiefiaction rate w.r.t coverage of the transplantation point: one point on a plot represents the 
global sosiefication rate at transplantion points covered by a given number of test cases. Each plot includes the best possible 
linear regression. In the top right corner of each plot, we also include a zoom on the left hand side of the distribution (e.g. 

from 1 to 20 test cases for Commons lO). 


the “amount of additional specification” of a given statement 
is generally higher between one and two test cases than be¬ 
tween 600 and 601. Here, the unconventional expression 
“amount of additional specification” refers to new contracts, 
new corner cases, etc. 

Second, one sees that the right hand side of the distri¬ 
bution is very irregular. For instance, for Apache Commons 
Collections, we see several spikes from 0 to 0.4 among points 
above 30 test cases . This can be explained by several fac¬ 
tors. The main one is that sosiefication rate - a ratio - has 
degenerated cases. One degenerated case is the absence of 
data: for instance, there is no statement that is covered by 
exactly 131 test cases in program Apache Commons Collec¬ 
tions. Another degenerated case is when there is too few 
data. For instance, in Gson, there is one single statement 
which is covered by 372 test cases. By chance, the variant 
made on this statement is a sosie. Consequently, for n=372 
test cases, the sosiefication rate is 100%. However, the av¬ 
erage sosiefication rate for hundreds of test cases is not at 
all in the 100%. This case is clearly an outlier, due to the 
limited amount of data (as we saw in section]^ there is only 
a limited number of statements covered by many test cases). 

Beyond this graphical interpretation, we have performed 
the following statistical test. For each project, we have man¬ 
ually selected a threshold separating low-tested transplan¬ 
tation points from high-tested transplantation points. This 
project-dependent thresholcQcorresponds to the thumbnails, 
which show the low-tested points that are below the thresh¬ 


^io: 20, codec: 27, lang: 28, collection: 33, gson: 28, jgit: 21 


old. This yields two different sosiefication rates, the sosiefi¬ 
cation rate of low-tested transplantation points and the rate 
for high-tested ones. Since a rate is a proportion, we can per¬ 
form a standard equality-of-proportion test, as implemented 
by ‘prop.test’ in R. For 4/6 projects, the null hypothesis 
(“the sosiefication rates are the same”) is rejected with 95% 
confidence. For io and lang, with a respective p-value of 0.4 
and 0.08, there is not enough data to reject the null hypoth¬ 
esis. 

The third finding is that there are no project for which the 
sosiefication rate clearly tends towards zero. In other terms, 
our data suggests that whatever the amount of specification, 
our code transformations still produce program variants that 
are sosies. We explain this by the presence of software plas¬ 
ticity, a concept that we introduce in this paper and for 
which we propose a first characterization. 

We dehne software plasticity as the ability of software 
modules to have different behaviors while still remaining 
correct. Software plasticity is very much related to Rinard’s 
work where the transplantation points happen to be in “for¬ 
giving regions” of code . 

To some extent, the sosiefication rate when the number 
of tests is high reflects this amount of software plasticity. It 
may even be the very first quantitative measure of it. If we 
put several data points in bins, we smooth the irregularities 
shown in Figure This results in an overall sosiefication 
rate of 10% for GSon. In Rinard’s term, the sosiefication 
rate obtained with our protocol suggests that there exists 
10% of forgiving regions in GSon. 
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Figure 3: The sosiefication rate for add transformations, 
according to the type of the transplant. 
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Figure 4: The sosiefication rate for the delete 
transformations, according to the type of the 
transplantation point 


Answer to RQl: Transplantion points covered by few 
test cases are easier targets for sosiefication. However, 
the sosiefication rate never goes down to zero. To us, 
it hints to an intrinsic property of the software subjects 
under study. We hypothesize that this property is the 
presence of software plasticity and forgiving regions. 


4.2 RQ2: Relation between Transplantation 
Points, Transplants and Test Suite 

We now look at whether the different types of program 
elements (i.e. types of AST nodes) are specified equally. 
Hence, we compute the sosiefication rate per AST node type. 

We start with the sosiefication operator “delete” (based on 
the number of sosies given in table[^. Figure [^provides the 
sosiefication rate with delete transformations according to 
the type of the transplantation point. This shows that there 
is large variation in the sosiefication rate per node type. For 
instance, this figure suggests that method invocations are 
less specified than while-blocks, since the sosiefication rate 
is higher. 

Considering the sosiefication operator “add”, figure [^pro¬ 
vides the sosiefication rate according to the type of the added 
code, i.e. the transplant (and not the transplantation point). 
We see that there are also large variations between node 
types as well as between projects. However, some regulari¬ 
ties emerge: for instance, adding a return always yield a low 
sosiefication rate. Along the line of RQl, this means that 
“return” nodes are widely specified. This matches the intu¬ 
ition that most assertions in test suites are made on returned 
values just after the computation. 

However, those two figures can be interpreted from a dif¬ 
ferent viewpoint. Let us consider again Figure about the 
sosiefication rate for delete transformations We can see that 
the deletion of continue nodes is always the most effective 
for sosiefication. Those nodes are usually used as shortcuts 


in the computation, hence removing them yields slower yet 
acceptable program variants; we discuss this in depth in the 
next section. We also observe a good sosiefication rate for 
deletion of method invocations. We explain this effect by 
the presence of side-effect free methods which can be safely 
removed (discussed also in the next section) and by the ex¬ 
istence of many redundant calls (discussed in next section). 

The same alternative viewpoint can be taken on code ad¬ 
dition. Looking more closely at figure [^ we realize that for 
all projects, the addition of assignment nodes is the most 
effective. This can be explained by the fact that there are 
many places in the code where the variable declaration and 
the first value assignment for this variable are separated by 
a few statements. In these situations it is possible to assign 
any arbitrary value to the variable, which will be cancelled 
by the subsequent assignment. Yao and colleagues observed 
a similar phenomenon of specific assignments that “skeeze 
out” a corrupted state [^). Also, for some project such as 
commons-io and jgit, the addition of method invocations is 
also quite effective. Similarly to deletion, it probably indi¬ 
cates a non-negligible proportion of side-effect free methods 
in the program. The addition of conditionals and loops is 
also effective. It is important to understand that a large 
number of these additional blocks have conditions such that 
the execution never enters the body of the block. 

Considering replace transformations that combine dele¬ 
tion and addition, they always have the lowest sosiefication 
rate. We do not provide any graphical representation of 
this data, for space constraint reasons. Yet, we make the 
following observations. First, picking a transplant and a 
transplantation point that are method invocations is quite 
effective. This suggests the presence of alternative yet equiv¬ 
alent calls, that is discussed in the next section and also by 
Carzaniga et al. [^. Second, we observe a certain plasticity 
around return statements: some of them can be replaced by 
the statement surrounded by a try or a condition. This sug¬ 
gests the existence of similar statements in the neighbour¬ 
hood of the transplantation point, which perform additional 
checks. 


Answer to RQ2: The addition of new statements is 
always the most effective way to produce new sosies. 
Deletion is most effective for some AST nodes types 
such as “continue”, to some extent, those AST nodes 
tend to be micro forgiving regions. This new knowledge 
is actionable for designing the next generation of sosie 
synthesizer, and maybe leveraged for other diversifica¬ 
tion techniques. 


4.3 RQ3: What are the different kinds of good 
sosies that we can generate? 

With RQl, we have seen that the sosiefication rate de¬ 
pends on the test suite execution signature. Now, we are 
interested in understanding whether there is a difference in 
nature between the sosies produced on low-tested transplan¬ 
tation points and those produced on high tested transplan¬ 
tation points. 

For each program, we selected sosies among extreme cases: 
those synthesized on transplantation points covered by a sin¬ 
gle test case or synthesized on points covered by the highest 
number of test cases. By doing this, we are able to build a 
taxonomy of sosies. 


























































The manual analysis is the result of more than two full 
weeks of work, where we have manually analyzed dozens 
of sosies to investigate what kind of software diversity re¬ 
sults from sosiefication. At a very coarse grain, before ex¬ 
plaining them in details, we distinguish three kinds of sosies: 

(i) revealer sosies indicate the presence of software plastic¬ 
ity in the code; (ii) fooler sosies are named after Cohen’s 
counter-measures for security, (iii) buggy sosies are made on 
transplantation points that are poorly specified by the test 
suite, the transformation simply introduces a bug. 

Revealer sosies take their denomination from the fact that 
they reveal something in the code that is implicit otherwise. i 
In the context of software diversification, they reveal the 
presence of forgiving regions. Once those regions are re- ^ 
vealed, a diversification algorithm can target them, with a 5 
high confidence that the variant will be acceptable. 

Fooler sosies are called like this in reference to the “garbage 
insertion” transformation proposed by Cohen [^. These g 
sosies add garbage code that can fool attackers who look 
for specific instruction sequences. To this extent, sosiefica¬ 
tion can be seen as a realization of Cohen’s transformation. 

Buggy sosies are simply the degenerated and uninteresting 
by-products resulting from of weak test cases. We will not 
provide a taxonomy of buggy sosies. 

In the following, we discuss categories of revealer and 
fooler sosies. For each category, we present a single archety¬ 
pal example from the ones synthesized for this work (table 
[^. Each example illustrates the difference in the original 
that produces a sosie. Examples come with a table that 
provides the values for the transplantation point features. 

A more complete set of examples is available onlin^ 

Plastic specification. Some program regions implement 
behavior which correctness is not binary. In other terms, 
there is no one single possible correct value, but rather sev¬ 
eral ones. We call such specification “plastic”. The regions of 
code implementing plastic specifications are extremely for¬ 
giving. They provide great opportunities for sosiefication 
which transforms the programs in many ways while main¬ 
taining valuable and correct-enough functionality. 

One situation that we have encountered many times re¬ 
lates to the production of hash keys. Methods that produce 
these keys have a very plastic specification: they must re¬ 
turn an integer value that can be used to identify an element. 
The only contract is that the function must be deterministic. ^ 
Otherwise, there is no other constraint on the value of the 
hash key. Listing [^illustrates an example of a sosie synthe- 3 
sized by removing a statement from a hash method (linej^. 

To us, the sosie still provides a perfectly valid functionality. 

7 

Listing 3: Delete a statement in hash (commons.collection) 

int hashCfinal Object key) { 
int h = key.hashCode () ; 
h += '(h << 9) ; 

h ~= h >>> 14; 

h += h << 4; 

h ‘= h >>> 10; 

return h ; )■ 

^fcTTraniToType node type 

422 del var declaration 


Optimization Some code is pure optimization, which is 
an ideal forgiving regions for diversification. If one removes 

^ github.com/DIVERSIFY-project/sosie-dataset 


it, the output is still exactly the same, only non-functional 
properties such as performance are impacted. Listing [^ 
shows an example of sosie that removes an optimization: 
at the end of the if-block (line[^, the original program 
stores the value of buf in toString, which allows to bypass 
the computation of buf next time toStringO is called; the 
sosie removes this part of the code, producing a potential 
performance degradation if the method is called intensively. 

Listing 4: Delete a statement in toString (commons.lang) 

String toStringO { 

String result = toString; 
if (result == null) { 

final StringBuilder buf = new StringBuilder (32) ; 

. ..compute buf 

result = buf.toString () ; 
toString = result; 

} 

return result;} 

transfo type node type 
2 del stmt list 


Code redundancy. It sometimes happens that the very 
same computation is performed several times in the same 
program. For instance, two subsequent calls to list. remove (o), 
even separated by other instructions are equivalent (as long 
as list and o do not change between). Sosiefication natu¬ 
rally exploits this computation redundancy through the re¬ 
moval or replacement of these redundant statements. Re¬ 
placement with side-effect free also produces valid sosies. 

Listing [^ displays an example of such a sosie (removing 
if-block at line[^. The statement if (isEmpty(padStr)) 
padStr = SPACE; assigns a value to padStr, then this 
variable is passed to methods leftPad and rightPad. Yet, 
each of these two methods include the exact same state¬ 
ment, which will eventually assign a value to padStr. So, 
the statement is redundant and can be removed from the 
original program, yielding a valid fooler sosie. Compared to 
sosies that remove some optimization, those sosies might be 
more performant than the original program. 

Listing 5: Delete in center (commons.lang) 

String center(String str, final int size, String 
padStr) { 

if (str == null || size <= 0) {return str;} 
if (isEmpty(padStr)) {padStr = SPACE;} 

str = leftPad(str, strLen + pads / 2, padStr); 

str = rightPad(str, size, padStr); 

return str ;} 

transfo type node type 

-dil-il- 


Implementation redundancy. It often happens that 
programs embed several different functions that provide the 
same service, in different ways. For example, there can exist 
several versions of the same method with different sets of pa¬ 
rameters, which can be used interchangeably by providing 
good parameter values. It is also possible to use libraries 
that provide this diversity of similar methods (as demon¬ 
strated by Carzaniga and colleagues |^). Listing [^ illus¬ 
trates the exploitation of such implementation redundancy 
inside the program (replace at line|^, i.e., ((Object []) ob¬ 
ject) [i] has the same behavior as Array. get (object, i). 












with completely different implementations. 

Listing 6: Replace in get (commons.collection) 

Object getCfinal Object object, final int index) { 


else if (object instanceof Object []) { 
return ((Object []) object) [i] ; 
try { 


+ 

return Array.ffet(ob1ect , i); 

+ 

} catch (final I1legalArgumentException ex) { 

+ 

+ 

throw new I1legalArgumentException("Unsupported 

+ 

obiect type: " + obiect.getClass () .getName ()) ; 

} 


} 

} 

#tc transto type node type 

1 rep return 


Optional functionality. In software, not all parts of 
equal importance. Some parts represent the core function¬ 
ality, other parts are about options and are not essential 
to the computation. Those optional parts are either not 
specified or the specification is of less importance. These 
are areas that can be safely removed or replaced while still 
producing useful variants. Listing is an example of sosie 
that exploits such optional functionality. The sosie com¬ 
pletely removes the body of the method, which is supposed 
to transform the type passed as parameter into an equiv¬ 
alent version that is serializable, and instead it returns the 
parameter. The sosie is covered by 624 different test cases, it 
is executed 6000 times and all executions complete success¬ 
fully and all assertions in the test cases are satisfied. This is 
an example of an advanced feature implemented in the core 
part of GSon that is not necessary to make the library run 
correctly. 

Fooler sosies. 

We have realized that a number of “add” and “replace” 
transformations result in sosies which have more code than 
the original and where the additional code is harmless for 
the overall execution. These sosies act exactly as Cohen’s 
“garbage insertion” strategy to fool malicious attackers, hence 
we call them fooler sosies. 

We found multiple kinds of fooler sosies: some add branches 
in the code or redundant method calls or redundant se¬ 
quences of method calls. Some others reduce the legitimate 
input space through additional checks on input parameters. 
Listingj^is an example of a fooler sosie, which adds a recur¬ 
sive call to ensureCapacityO (line |12[ ). This could turn the 
method into an infinite recursion, except that in the addi¬ 
tional recursive call, the value of the parameter is such that 
the condition of the first if-statement always holds true and 
the method execution immediately stops. The additional 
call adds a harmless method call in the execution flow. 

Discussion Let us now consider again the transplantation 
point features given for each sosie. Most sosies identified as 
buggy with we manual analysis are done on transplantation 
points covered by a single test case. In other words, the risk 
of synthesizing bad sosies increases when the number of test 
cases is low. 

More interestingly, we realized that valid revealer and 
fooler sosies can be found both on points intensively tested 
and on weakly tested points. This makes us conclude that 


Listing 7: Replace in canonicalize (GSon) 

public static Type canonicalize(Type type) { 
if (type instanceof Class) { 

Class<?> c = (Class<?>) type; 
return c.isArrayO ? new 
GenericArrayTypeImp1(canonicalize(c. 
getComponentType())) : c; 

> 

else 

if (type instanceof ParameterizedType) { 

ParameterizedType p = (ParameterizedType) type 

return new ParameterizedTypelmpl(p. 
getOwnerType(), 

p.getRawType(), p.getActualTypeArguments() 

) ; 

> 

else 

if (type instanceof GenericArrayType) { 

GenericArrayType g = (GenericArrayType) type 

return new -GenericArrayTypelmpl(g. 
getGenericComponentType()); 

} 

else 

if (type instanceof WildcardType) { 

WildcardType w = (WildcardType) type; 
return new WildcardTypeImpl(w.getUpperBounds 

() , 

w.getLowerBounds ()) ; 

} 

else -C 

return type; 

> 

+ return type; 

} 

transfo type node type 

623 rep iT 


if a region is intrinsically plastic (has a plastic specification 
or is optional), the number of test cases barely matters, the 
only fact that the specification and the corresponding code 
region is plastic explains the fact that we can easily syn- 
thetize sosies. This confirms a trend we observed in RQl: 
no matter how much a region is tested, we can synthesize 
sosies because of some intrinsic forms of plasticity. 


Answer to RQ3: We have provided a first classifi¬ 
cation or software sosies, founded on the concepts of 
revealer, fooler and buggy sosies. The “revealers” indi¬ 
cate forgiving regions The “foolers” are useful in 
a protection setting [9 . The buggy sosies are due to 
weak test cases. Our manual analysis shows the variety 
of roles that code plays in a program. It uncovers the 
multitude of opportunities that exist for sosie synthesis 
and diversification in real-world programs. 


4.4 Threats to Validity 

We performed a large scale experiment in a relatively un¬ 
explored domain: software diversification at the application 
code level. We now present the threats to the validity. 

Our findings might not generalize to all types of applica¬ 
tions. We selected frameworks and libraries because of their 
popularity, their longevity and the very high quality of their 
test suites. Yet, our observations about the large variations 
among statements, with respect to test coverage, and about 
code plasticity can be different when analyzing programs in 
other domains. 
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Listing 8: Add in ensureCapacity (commons.collection) 

void ensureCapacity( final int newCapacity) { 
final int oldCapacity = data.length; 
if (newCapacity <= oldCapacity) { 
return ; 

} 

if (size == 0) { 

threshold = calculateThreshold(newCapacity, 
loadFactor); 

data = new HashEntry[newCapacity]; 

} else { 


} 

+ ensureCapac it y (thre shold) }• 



#tc 

transto type 

node type 

8 

add 

invocation 


Our large scale experiments rely on a complex tool chain, 
which integrates code transformation, instrumentation, trace 
analysis and statistical analysis. We also rely on the GridSOOO 
grid infrastructure to run millions of transformations. We 
did extensive testing of our code transformation infrastruc¬ 
ture, built on top of the Spoon framework that has been de¬ 
veloped, tested and maintained for over more than 10 years. 
However, as for any large scale experimental infrastructure, 
there are surely bugs in this software. We hope that they 
only change marginal quantitative things, and not the quali¬ 
tative essence of our findings. Our infrastructure is publicly 
available on Github[3 


5. RELATED WORK 

As mentioned on several occasions in this paper, our work 
is related to the multiple investigations of Martin Rinard and 
his group about software tradeoffs between correctness and 
other properties such as security or performance. Rinard has 
defined the general concept of “acceptability envelop”, and 
explored its application in different domains. For example, 
they injected off-by-one errors on loop termination condi¬ 
tions in order to characterize the behavior of two programs 
under errors , they also experimented with runtime loop 
perforation to explore the same envelop [23| . In all these 
cases, the authors use a set of test scenarios to assess the 
acceptability of the changes. Our work contributes to this 
body of knowledge about the nature of the acceptability en¬ 
velop by investigating new kinds of transformations as well 
as a new analysis method to locate code regions that can 
tolerate changes. The set of revealer and fooler sosies for a 
given program can be considered as forming the body within 


the “acceptability envelop” of the program 18 

Mutational robustness is the ability of software to 
resist to mutations. The essential difference between both 
works lies in the definition of program transformations: Schulte 
et al. use only random operations, while we use a heuristics 
based on types and variable renaming. Also, Schulte et al. 
say that software is robust to mutations, we say that we can 
synthesize diversity and that this indicates the presence of 
true plasticity in the code. 

The recent advances in software transplantation by Sidiroglou 
and colleagues and Barr and colleagues is related to 
sosiehcation. Both work transfer code from a donor pro¬ 
gram into recepient applications. Sidiroglou performs trans¬ 
plantation for bug fixing purposes and Barr does it to reuse 


github.com/DIVERSIFY-project/sosie-dataset 


functionality from one program to another. Sosiehcation, 
especially the fooler sosies, can be seen as a form of internal 
micro transplantation. 

The work of Langdon and Harman [14| dehnes an itera¬ 
tive process of code transformations and testing in order to 
speed-up program execution. Schulte and colleagues use a 
similar process to reduce energy consumption of embedded 
programs . Works in the area of genetic improvements of 
programs is related to ours since they also rely on code trans¬ 
formations and test suites in order to automatically produce 
different versions of a program. Our analysis of statement 
execution signatures could also improve such approaches. 

Our investigations of software plasticity at the edge of cor¬ 
rectness tradeoffs directly relate to seminal works that advo¬ 
cate for novel ways of building software that is more approx¬ 
imate and evolvable, but also less brittle. In particular, our 
work is very much inspired by the work of Richard Gabriel 
[11| , Gerald Sussman and Mary Shaw [^. They all 
warn against the desire of building perfectly correct system, 
which can only be correct in very specihc conditions and 
are consequently very brittle outside these conditions. They 
advocate for new approaches that would support the con¬ 
struction of software systems that have the ability to evolve 
and adapt, in exchange of certain tradeoffs with respect to 
correctness. We foresee our investigations about automatic 
diversification of application source code as a contribution 
towards the design of such new approaches. 

6. CONCLUSION 

In this paper, we have presented an exploration in the 
area of software diversihcation. We have analyzed a spe¬ 
cific diversification technique - sosiehcation - in the light of 
the interactions between a test suite and the program under 
test. This investigation combined automated analysis with 
the manual exploration of a large sample of sosies. This 
enabled us to contribute to the body of knowledge on au¬ 
tomatic software diversity as follows. First, we have shown 
the correlation between statement execution signatures and 
sosiehcation, and we demonstrated that sosiehcation rate 
never goes down to zero, indicating a certain degree of in- 
trisic plasticity in any program; Second, we have provided 
novel pieces of evidence about the presence and the nature 
of forgiving regions in software. Third, we demonstrated 
the effectiveness of code addition and deletion, to synthesize 
sosies that can contribute to previous work on OS protection 
by Gohen and failure oblivious computing by Rinard . 

As future work, we wish to exploit these hndings in order 
to automate the synthesis of variants that establish trade¬ 
offs between functional correctness and other qualities such 
as performance. We believe that software developers must 
constantly take into account a wide variety of concerns into 
the code that goes into production and, to this extent, they 
must constantly take multi-criteria decisions. Eventually 
they deliver a product that is a single point on the Paretto 
of all possible solutions that can satisfy the same require¬ 
ments. We want to exploit sosiehcation and other diversi¬ 
hcation techniques as a way to automatically explore the 
neighbourhood on this Paretto front. 
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